As we have seen earlier, genomic scores are often stored in wiggle format.
A perhaps more common human readable format is bedGraph.
Genomic Scores are heavily used in Genomics and High throughput sequencing as they offer a simple mechanism to review a defined metric over the linear genome at a specified resolution.
From our last session we identified Myc peaks within the Igfbp2 locus and in IGV compared Myc ChIP-seq signal from Encode over our peaks.
Two popular Bioconductor packages for dealing with Genomics Scores are:
Now we have the package installed, we can load the library rtracklayer which we will use to import and export from/to bedGraph and bigWig.
library(rtracklayer)We will also be making use of the functions in the GenomicRanges package. We dont need to load GenomicRanges directly here because the rtracklayer does this for us.
Package dependencies and imports allow one package to make use of functions from another.
The rtracklayer package provides functions to import genomic scores from a bedGraph using the import.bedGraph() function.
myBedG <- import.bedGraph("../../Data/TSS_ENCFF940MBK.bedGraph")Because we only have 4 columns in a bedGraph and no strand information, the GRanges intervals are unstranded with * in their strand column
strand(myBedG)## factor-Rle of length 2161 with 1 run
## Lengths: 2161
## Values : *
## Levels(3): + - *
The import bigWig’s genomic scores are again imported as a GRanges object containing the same information as the imported bedGraph.
myBigWig[1:3]## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 [ 1, 72811054] * | 0
## [2] chr1 [72811055, 72811119] * | 0.690400004386902
## [3] chr1 [72811120, 72811145] * | 0.578019976615906
## -------
## seqinfo: 54 sequences from an unspecified genome
We can however specify the type of objects we would like to return from the import.bedGraph and import.bw functions.
Here we will import the bigWig as a object we have briefly seen, the Rle object (run length encoding). Here we have an Rlelist (a list of Rle objects)
myBigWig <- import.bw("../../Data/TSS_ENCFF940MBK.bw",
as = "RleList")
class(myBigWig)## [1] "SimpleRleList"
## attr(,"package")
## [1] "IRanges"
Run length encoding allows for a very efficient storage of long stretchs of repeated values.
We have already seen an rle in our cigar string from SAM files.
Now can construct a named RleList containing the Rle objects using the RleList() function.
myNumbers2 <- c(0,0,0,0,0,1,1,1,2,2,2,2,2)
chr1Scores <- Rle(myNumbers)
chr2Scores <- Rle(myNumbers2)
myRleList <- RleList(chr1=chr1Scores,chr2=chr2Scores)
myRleList## RleList of length 2
## $chr1
## numeric-Rle of length 13 with 3 runs
## Lengths: 5 3 5
## Values : 0 1 0
##
## $chr2
## numeric-Rle of length 13 with 3 runs
## Lengths: 5 3 5
## Values : 0 1 2
To access elements of a RleList we can use our regular accessors for lists $ and [[]].
Here we retrieve the Rle object named chr1 containing all the genomic scores information for chromosome 1.
chr1_rle <- myBigWig$chr1
# Or
chr1_rle <- myBigWig[["chr1"]]
chr1_rle## numeric-Rle of length 195471971 with 2108 runs
## Lengths: 72811054 65 ... 122614997
## Values : 0 0.690400004386902 ... 0
We can also replace values in a Rle, just as we would with a vector.
Here i replace values for all basepairs between 1 and 10 to 100
chr1_rle[1:10] <- 100
chr1_rle## numeric-Rle of length 195471971 with 2109 runs
## Lengths: 10 72811044 ... 122614997
## Values : 100 0 ... 0
Or to a data.frame
rleAsDF <- as.data.frame(chr1_rle[1:10])
rleAsDF## value
## 1 100
## 2 100
## 3 100
## 4 100
## 5 100
## 6 100
## 7 100
## 8 100
## 9 100
## 10 100
We can use many simple arithmetric operations such as +, -, / and *****
chr1_rle+1000## numeric-Rle of length 195471971 with 2109 runs
## Lengths: 10 72811044 ... 122614997
## Values : 1100 1000 ... 1000
(chr1_rle+1000)*10## numeric-Rle of length 195471971 with 2109 runs
## Lengths: 10 72811044 ... 122614997
## Values : 11000 10000 ... 10000
We use this logical Rle to replace values less than 10 with 0 as we would use a logical vector with standard vectors.
chr1_rle[chr1_rle < 10] <- 0
chr1_rle## numeric-Rle of length 195471971 with 122 runs
## Lengths: 10 72823081 ... 122627007
## Values : 100 0 ... 0
Very usefully, We can also apply arithmetic and mathematical operations to whole RleLists as imported from the bigWig file.
myBigWig <- import.bw("../../Data/TSS_ENCFF940MBK.bw",as="RleList")
myBigWig+10## RleList of length 54
## $chr1
## numeric-Rle of length 195471971 with 2108 runs
## Lengths: 72811054 65 ... 122614997
## Values : 10 10.6904000043869 ... 10
##
## $chr10
## numeric-Rle of length 130694993 with 1 run
## Lengths: 130694993
## Values : 10
##
## $chr11
## numeric-Rle of length 122082543 with 1 run
## Lengths: 122082543
## Values : 10
##
## $chr12
## numeric-Rle of length 120129022 with 1 run
## Lengths: 120129022
## Values : 10
##
## $chr13
## numeric-Rle of length 120421639 with 1 run
## Lengths: 120421639
## Values : 10
##
## ...
## <49 more elements>
We can in fact use GRanges objects to index our RleList objects.The GRanges provides the intervals from which genomic scores are retrieved. The resulting RleList object contains an entry with the scores for each interval in the GRanges object.
To demonstrate first we can retrieve the Myc Peaks calls which overlap the region we are reviewing.
myRanges <- GRanges("chr1",ranges = IRanges(72811055,72856974))
mycPeaks <- import.bed("../../Data/Myc_Ch12_1_withInput_Input_Ch12_summits.bed")
mycPeaks <- resize(mycPeaks,50,fix="center")
newMycPeaks <- mycPeaks[mycPeaks %over% myRanges]
newMycPeaks## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 [72824021, 72824070] * |
## [2] chr1 [72844876, 72844925] * |
## name score
## <character> <numeric>
## [1] Myc_Ch12_1_withInput_Input_Ch12_peak_246 18.55062
## [2] Myc_Ch12_1_withInput_Input_Ch12_peak_247 9.27569
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
With the RleList containing our scores over the Myc peaks we can now gather summary statistics as with all RleList objects
sum(rleOverGranges)## chr1 chr1
## 1009.0643 480.4242
Now we have our RleList object we export this to a bigWig using the export.bw() function.
export.bw(myRleList,con="chr1_Myc.bw")When importing only a portion of a bigWig we simply need to specify a GRanges of regions we wish to retrieve to the BigWigSelection() function.
Here we will use the Myc Peaks in our window as the GRanges of regions for selection.
newMycPeaks## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 [72824021, 72824070] * |
## [2] chr1 [72844876, 72844925] * |
## name score
## <character> <numeric>
## [1] Myc_Ch12_1_withInput_Input_Ch12_peak_246 18.55062
## [2] Myc_Ch12_1_withInput_Input_Ch12_peak_247 9.27569
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths